feat: Ignore DPUs, so that they can be externally managed by vinodchitraliNVIDIA · Pull Request #85 · NVIDIA/bare-metal-manager-core

vinodchitraliNVIDIA · 2026-01-27T06:08:45Z

Description

The case when DPU(s) physically exists on host. They are assigned with machine interface, but they are not part Managed Host. The PR skips DPU configaration. And Force Host to boot from NIC

Type of Change

Add - New feature or capability
Change - Changes in existing functionality
Fix - Bug fixes
Remove - Removed features or deprecated functionality
Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

This PR contains breaking changes

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed
No testing required (docs, internal refactor, etc.)

Additional Notes

copy-pr-bot · 2026-01-27T06:08:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-01-27T12:45:45Z

🛡️ Vulnerability Scan

🚨 Found 30 vulnerability(ies)

Severity Breakdown:

🔴 Critical/High: 30
🟡 Medium: 0
🔵 Low/Info: 0

📋 Top Vulnerabilities

GHSA-2cgv-28vr-rv6j: Package: libcrux-intrinsics
Installed Version: 0.0.3
Vulnerability GHSA-2cgv-28vr-rv6j
Severity: HIGH
Fixed Version: 0.0.4
Link: GHSA-2cgv-28vr-rv6j (Cargo.lock)
DS002: Artifact: crates/api/Dockerfile
Type: dockerfile
Vulnerability DS002
Severity: HIGH
Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
Link: DS002 (crates/api/Dockerfile)
DS002: Artifact: crates/dhcp/Dockerfile
Type: dockerfile
Vulnerability DS002
Severity: HIGH
Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
Link: DS002 (crates/dhcp/Dockerfile)
DS002: Artifact: crates/dns/Dockerfile
Type: dockerfile
Vulnerability DS002
Severity: HIGH
Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
Link: DS002 (crates/dns/Dockerfile)
KSV014: Artifact: deploy/carbide-base/api/deployment.yaml
Type: kubernetes
Vulnerability KSV014
Severity: HIGH
Message: Container 'carbide-api' of Deployment 'carbide-api' should set 'securityContext.readOnlyRootFilesystem' to true
Link: KSV014 (deploy/carbide-base/api/deployment.yaml)
KSV118: Artifact: deploy/carbide-base/api/deployment.yaml
Type: kubernetes
Vulnerability KSV118
Severity: HIGH
Message: deployment carbide-api in default namespace is using the default security context, which allows root privileges
Link: KSV118 (deploy/carbide-base/api/deployment.yaml)
KSV014: Artifact: deploy/carbide-base/api/migration.yaml
Type: kubernetes
Vulnerability KSV014
Severity: HIGH
Message: Container 'carbide-api-migrate' of Job 'carbide-api-migrate' should set 'securityContext.readOnlyRootFilesystem' to true
Link: KSV014 (deploy/carbide-base/api/migration.yaml)
KSV118: Artifact: deploy/carbide-base/api/migration.yaml
Type: kubernetes
Vulnerability KSV118
Severity: HIGH
Message: container carbide-api-migrate in default namespace is using the default security context
Link: KSV118 (deploy/carbide-base/api/migration.yaml)
KSV118: Artifact: deploy/carbide-base/api/migration.yaml
Type: kubernetes
Vulnerability KSV118
Severity: HIGH
Message: job carbide-api-migrate in default namespace is using the default security context, which allows root privileges
Link: KSV118 (deploy/carbide-base/api/migration.yaml)
KSV014: Artifact: deploy/carbide-base/dhcp/deployment.yaml
Type: kubernetes
Vulnerability KSV014
Severity: HIGH
Message: Container 'carbide-dhcp' of Deployment 'carbide-dhcp' should set 'securityContext.readOnlyRootFilesystem' to true
Link: KSV014 (deploy/carbide-base/dhcp/deployment.yaml)

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

github-actions · 2026-01-27T12:45:56Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

github-actions · 2026-01-27T12:47:21Z

🛡️ CodeQL Analysis

✅ No security issues found!

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

kensimon · 2026-01-27T14:47:01Z

@vinodchitraliNVIDIA could you make sure to write up a description of your changes? This is hard to review because I have zero context for what this change is for.

Particularly, why is this change needed? When doing zero-dpu testing we tested with hosts that had onboard NICs and it worked fine. I'm confused why there needs to be any changes here (I don't doubt there does, but I just want to understand what's lacking in the current code.)

vinodchitraliNVIDIA · 2026-01-27T16:31:21Z

@vinodchitraliNVIDIA could you make sure to write up a description of your changes? This is hard to review because I have zero context for what this change is for.

Particularly, why is this change needed? When doing zero-dpu testing we tested with hosts that had onboard NICs and it worked fine. I'm confused why there needs to be any changes here (I don't doubt there does, but I just want to understand what's lacking in the current code.)

In zero DPU case - physically there wont be any DPU exists. If allow_zero_dpu_hosts flag is set then machines are created.

Our usecase DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host by skipping DPU configaration. Use case looks more or less same with later case DPU(s) may or may not exists

Will update the commit description

github-actions · 2026-01-27T16:35:26Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

github-actions · 2026-01-27T16:36:08Z

🛡️ Vulnerability Scan

🚨 Found 30 vulnerability(ies)

Severity Breakdown:

🔴 Critical/High: 30
🟡 Medium: 0
🔵 Low/Info: 0

📋 Top Vulnerabilities

GHSA-2cgv-28vr-rv6j: Package: libcrux-intrinsics
Installed Version: 0.0.3
Vulnerability GHSA-2cgv-28vr-rv6j
Severity: HIGH
Fixed Version: 0.0.4
Link: GHSA-2cgv-28vr-rv6j (Cargo.lock)
DS002: Artifact: crates/api/Dockerfile
Type: dockerfile
Vulnerability DS002
Severity: HIGH
Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
Link: DS002 (crates/api/Dockerfile)
DS002: Artifact: crates/dhcp/Dockerfile
Type: dockerfile
Vulnerability DS002
Severity: HIGH
Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
Link: DS002 (crates/dhcp/Dockerfile)
DS002: Artifact: crates/dns/Dockerfile
Type: dockerfile
Vulnerability DS002
Severity: HIGH
Message: Specify at least 1 USER command in Dockerfile with non-root user as argument
Link: DS002 (crates/dns/Dockerfile)
KSV014: Artifact: deploy/carbide-base/api/deployment.yaml
Type: kubernetes
Vulnerability KSV014
Severity: HIGH
Message: Container 'carbide-api' of Deployment 'carbide-api' should set 'securityContext.readOnlyRootFilesystem' to true
Link: KSV014 (deploy/carbide-base/api/deployment.yaml)
KSV118: Artifact: deploy/carbide-base/api/deployment.yaml
Type: kubernetes
Vulnerability KSV118
Severity: HIGH
Message: deployment carbide-api in default namespace is using the default security context, which allows root privileges
Link: KSV118 (deploy/carbide-base/api/deployment.yaml)
KSV014: Artifact: deploy/carbide-base/api/migration.yaml
Type: kubernetes
Vulnerability KSV014
Severity: HIGH
Message: Container 'carbide-api-migrate' of Job 'carbide-api-migrate' should set 'securityContext.readOnlyRootFilesystem' to true
Link: KSV014 (deploy/carbide-base/api/migration.yaml)
KSV118: Artifact: deploy/carbide-base/api/migration.yaml
Type: kubernetes
Vulnerability KSV118
Severity: HIGH
Message: container carbide-api-migrate in default namespace is using the default security context
Link: KSV118 (deploy/carbide-base/api/migration.yaml)
KSV118: Artifact: deploy/carbide-base/api/migration.yaml
Type: kubernetes
Vulnerability KSV118
Severity: HIGH
Message: job carbide-api-migrate in default namespace is using the default security context, which allows root privileges
Link: KSV118 (deploy/carbide-base/api/migration.yaml)
KSV014: Artifact: deploy/carbide-base/dhcp/deployment.yaml
Type: kubernetes
Vulnerability KSV014
Severity: HIGH
Message: Container 'carbide-dhcp' of Deployment 'carbide-dhcp' should set 'securityContext.readOnlyRootFilesystem' to true
Link: KSV014 (deploy/carbide-base/dhcp/deployment.yaml)

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

github-actions · 2026-01-27T16:36:15Z

🛡️ CodeQL Analysis

✅ No security issues found!

💡 Note: Enable GitHub Advanced Security to see full details in the Security tab.

Matthias247 · 2026-01-27T16:43:58Z

we should find a way to support the use-case without building yet another feature. I'm concerned that even the zero-dpu path has no docs and barely seen any testing. This has even less (no unit-tests), so its very likely just to be forgotten and break.

Any chance we can get this just aligned with regular zero-dpu? E.g. remove the DPUs from the target hosts, or somehow set them in NIC mode and then just use the NIC that the host boots from?

I also feel like this change might just be the very beginning of supporting such hosts: We'd also need to look at the inventory path, SKUs and SKU validation, Software updates, etc.

kensimon · 2026-01-27T19:41:34Z

Our usecase DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host by skipping DPU configaration. Use case looks more or less same with later case DPU(s) may or may not exists

@vinodchitraliNVIDIA do you mean the DPU's are in NIC mode? Because the zero-dpu case already covers this (it we discover a host and its DPU is in nic-mode, it's treated as a zero-DPU host. We even have integration tests for this.)

Or, do you mean these are DPF-managed hosts or something else, where we see a DPU, and it's not in NIC mode, but carbide opts to not manage the DPU in favor of something else managing it? If so, it's probably worth (a) spelling that out in the PR description, and (b) renaming some of the things here to indicate "externally managed DPU" instead of "onboard NIC"

vinodchitraliNVIDIA · 2026-01-28T12:49:39Z

Our usecase DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host by skipping DPU configaration. Use case looks more or less same with later case DPU(s) may or may not exists

@vinodchitraliNVIDIA do you mean the DPU's are in NIC mode? Because the zero-dpu case already covers this (it we discover a host and its DPU is in nic-mode, it's treated as a zero-DPU host. We even have integration tests for this.)

Or, do you mean these are DPF-managed hosts or something else, where we see a DPU, and it's not in NIC mode, but carbide opts to not manage the DPU in favor of something else managing it? If so, it's probably worth (a) spelling that out in the PR description, and (b) renaming some of the things here to indicate "externally managed DPU" instead of "onboard NIC"

DPU's can be in any mode. "externally managed DPU" let me think through.

Initially i used allow_zero_dpu_hosts but the managed_host has discovered dpus hence the zero dpu code path never hit. This is bcz the code assumes that there will be no DPU associated with explored host.

vinodchitraliNVIDIA · 2026-01-28T12:55:09Z

we should find a way to support the use-case without building yet another feature. I'm concerned that even the zero-dpu path has no docs and barely seen any testing. This has even less (no unit-tests), so its very likely just to be forgotten and break.

Any chance we can get this just aligned with regular zero-dpu? E.g. remove the DPUs from the target hosts, or somehow set them in NIC mode and then just use the NIC that the host boots from?

I also feel like this change might just be the very beginning of supporting such hosts: We'd also need to look at the inventory path, SKUs and SKU validation, Software updates, etc.

current ZERO DPU code path assumes that host will not have attached DPU, @kensimon correct me if i am wrong. Our case DPUs will be attached host, carbide DHCP will assign IP but they are not configured in managed host. Let me add few test cases

The case when DPU(s) physically exists host. They are assigned with machine interface, but they are not part Managed Host. The PR skips DPU configaration. This allows DPUs stays outside the machine's lifecycle workflow Signed-off-by: Vinod Chitrali <vchitrali@nvidia.com>

vinodchitraliNVIDIA · 2026-01-29T14:58:12Z

@ajf @kensimon please review the code

kensimon · 2026-01-29T16:18:02Z

current ZERO DPU code path assumes that host will not have attached DPU, @kensimon correct me if i am wrong. Our case DPUs will be attached host, carbide DHCP will assign IP but they are not configured in managed host. Let me add few test cases

@vinodchitraliNVIDIA this is not the case, systems can have DPU's in what is called "NIC mode", which means they show up as a "dumb NIC" on the device and are not something we put any of our code on. This is a BIOS setting you have to set on the system. If it's set, we don't count it as a "DPU" on the device, it's just a regular NIC. So the zero-DPU code paths apply if the only DPU's on the host are NIC-mode.

But a NIC-mode DPU means nobody else can put code on it either... so putting a DPU in NIC mode means it can't be DPF-managed either.

kensimon · 2026-01-29T16:21:25Z

crates/api/src/site_explorer/machine_creator.rs

+                    return Ok(false);
+                }
+                tracing::info!("Created managed_host with zero DPUs");
+            } else if self.config.use_onboard_nic.load(Ordering::Relaxed) {


I still don't understand what this change is supposed to do. If we've gotten to this point in site_explorer, we've seen no DPU's on the host, and if zero-DPU configuration is allowed, we already ingest it with no DPU's.

Why do we need another config setting called use_onboard_nic, and another function create_onboard_nic_machine? What do these do that are different from the zero-dpu path?

@kensimon is there any known bug ? Couple of months ago I tried with zero dpu flag. It din't work. Since the GB200 machine has multiple MAC address. Faced issue there. Also DPU list in managed host is not empty.

Not that I know of? If there is a bug that you found, we should fix it. I don't think we need a fully separate code path for GB200's when we can just fix the current one.

If you have an issue with GB200's in zero-dpu mode, could you file an nvbug and assign it to me? I need logs/details/reproduction steps if possible.

ajf · 2026-02-04T15:38:17Z

@vinodchitraliNVIDIA can you schedule a meeting for how this is different from the Zero-DPU code path? Just like 15 minutes to explain what issues you're running into. Zero-DPU ideally should work with any type of NIC on the machine.

vinodchitraliNVIDIA · 2026-02-04T17:41:52Z

@vinodchitraliNVIDIA can you schedule a meeting for how this is different from the Zero-DPU code path? Just like 15 minutes to explain what issues you're running into. Zero-DPU ideally should work with any type of NIC on the machine.

Sure will do. Let me setup env with with zero DPU with latest GIT hub.

vinodchitraliNVIDIA requested a review from a team as a code owner January 27, 2026 06:08

vinodchitraliNVIDIA force-pushed the vc/nic branch 12 times, most recently from 2a94f1f to 68e2edc Compare January 27, 2026 12:44

vinodchitraliNVIDIA force-pushed the vc/nic branch from 68e2edc to 30d4a00 Compare January 27, 2026 16:33

vinodchitraliNVIDIA force-pushed the vc/nic branch from 30d4a00 to 761b7b6 Compare January 29, 2026 14:18

vinodchitraliNVIDIA changed the title ~~feat: creating managed host for onboard nic hosts~~ feat: Ignore DPUs, so that they can be externally managed Jan 29, 2026

vinodchitraliNVIDIA force-pushed the vc/nic branch from 761b7b6 to c903c8c Compare January 29, 2026 14:56

kensimon reviewed Jan 29, 2026

View reviewed changes

Comments

Conversation

vinodchitraliNVIDIA commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues (Optional)

Breaking Changes

Testing

Additional Notes

Uh oh!

copy-pr-bot bot commented Jan 27, 2026

Uh oh!

github-actions bot commented Jan 27, 2026

🛡️ Vulnerability Scan

Uh oh!

github-actions bot commented Jan 27, 2026

🔐 TruffleHog Secret Scan

Uh oh!

github-actions bot commented Jan 27, 2026

🛡️ CodeQL Analysis

Uh oh!

kensimon commented Jan 27, 2026

Uh oh!

vinodchitraliNVIDIA commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 27, 2026

🔐 TruffleHog Secret Scan

Uh oh!

github-actions bot commented Jan 27, 2026

🛡️ Vulnerability Scan

Uh oh!

github-actions bot commented Jan 27, 2026

🛡️ CodeQL Analysis

Uh oh!

Matthias247 commented Jan 27, 2026

Uh oh!

kensimon commented Jan 27, 2026

Uh oh!

vinodchitraliNVIDIA commented Jan 28, 2026

Uh oh!

vinodchitraliNVIDIA commented Jan 28, 2026

Uh oh!

vinodchitraliNVIDIA commented Jan 29, 2026

Uh oh!

kensimon commented Jan 29, 2026

Uh oh!

kensimon Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

vinodchitraliNVIDIA Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kensimon Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

ajf commented Feb 4, 2026

Uh oh!

vinodchitraliNVIDIA commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vinodchitraliNVIDIA commented Jan 27, 2026 •

edited

Loading

vinodchitraliNVIDIA commented Jan 27, 2026 •

edited

Loading

vinodchitraliNVIDIA Feb 3, 2026 •

edited

Loading